Forecasting COVID-19 activity in Australia to support pandemic response: May to October 2020

As of January 2021, Australia had effectively controlled local transmission of COVID-19 despite a steady influx of imported cases and several local, but contained, outbreaks in 2020. Throughout 2020, state and territory public health responses were informed by weekly situational reports that included an ensemble forecast of daily COVID-19 cases for each jurisdiction. We present here an analysis of one forecasting model included in this ensemble across the variety of scenarios experienced by each jurisdiction from May to October 2020. We examine how successfully the forecasts characterised future case incidence, subject to variations in data timeliness and completeness, showcase how we adapted these forecasts to support decisions of public health priority in rapidly-evolving situations, evaluate the impact of key model features on forecast skill, and demonstrate how to assess forecast skill in real-time before the ground truth is known. Conditioning the model on the most recent, but incomplete, data improved the forecast skill, emphasising the importance of developing strong quantitative models of surveillance system characteristics, such as ascertainment delay distributions. Forecast skill was highest when there were at least 10 reported cases per day, the circumstances in which authorities were most in need of forecasts to aid in planning and response.

During the second wave in Victoria, strong public health measures and social restrictions markedly reduced the TP, which caused forecasts to become sharper at longer lead times.
As a general rule, forecasts usually grow more uncertain as they predict further ahead [1]. However, the forecasts presented here for longer lead times tended to exhibit little growth in uncertainty, or even grow more certain ( Figure S2). For jurisdictions that primarily reported zero cases on a given day, the absence of epidemic growth meant that the forecast trajectories tended to local extinction and did not allow for a subsequent rebound. Accordingly, forecast sharpness in these jurisdictions remained very close to zero (i.e., no uncertainty) for all lead times.
For jurisdictions that reported substantial COVID-19 activity, forecasts for longer lead times either (a) exhibited little increase in uncertainty (New South Wales, Queensland); or (b) grew more certain but did not approach zero (Victoria). These jurisdictions imposed intense mobility and gathering restrictions in response to local COVID-19 activity, and the reduced contact rates in the general population meant that the model reproduction number tended to decrease below unity over the forecast horizon.
The forecasts did, however, grow more uncertain as the daily number of cases increased ( Figure S1). And forecasts generated under the assumption that transmission would be sustained at the current active-case R eff grew more uncertain for longer lead times ( Figure S3, green lines).  Bias was defined so that 0 indicated an unbiased forecast, whose probability mass was distributed equally above and below the data, −1 indicated a forecast that completely under-estimated the data, and 1 indicated a forecast that completely over-estimated the data.

S.2 Forecast bias
For jurisdictions that primarily reported zero cases on a given day (e.g., Australian Capital Territory, Northern Territory, South Australia, Tasmania, Western Australia) the forecast bias for days where one or more cases were reported is close to negative one ( Figure S4). This indicates a deliberate model feature: our careful separation of the potential for reintroduction from the consequence of reintroduction (Section 4.3). For jurisdictions which had achieved local elimination and for which there was no evidence of reintroduction, the forecasts did not allow for subsequent epidemic activity.
In jurisdictions that saw substantial COVID-19 activity (New South Wales, Queensland, Victoria) the forecast bias was also negative and grew in magnitude as the lead time increased, but did not approach negative one ( Figure S4). Forecasts exhibited a small positive bias on days where zero cases were reported ( Figure S5, left-most category), because the forecasts always included some uncertainty and did not allow negative case counts. As the number of daily cases increased, the forecast bias became negative and grew in magnitude ( Figure S5). When we consider the pre-peak and post-peak intervals of the second wave in Victoria and in New South Wales -the only period where any jurisdiction reported more than 100 local cases in a single day -we find that the forecast bias was negative prior to the peak and was approximately zero after the peak ( Figure S6 and Figure S7). This suggests that the negative bias was due to a combination of factors.
During the second wave, the active-case R eff was consistently higher than the wholepopulation transmission potential. So while the restrictions imposed during the second wave did reduce the active-case R eff over time, and eventually brought the second wave to an end, our model over-estimated the decrease in local transmission over the forecast horizon. Despite this negative bias, the model consistently out-performed forecasts generated under the assumption that local transmission would be sustained at the current active-case R eff (see Section S.3 for details).  There is a general trend to underestimate case counts (negative bias, blue) prior to the peak and to overestimate case counts (positive bias, red) after the peak. However, forecasts made in the few weeks prior to the observed peaks also exhibit positive bias. We considered forecast skill as a function of the number of cases ultimately recorded on each day in each jurisdiction (comprising 1,688 observations). We divided the daily case numbers into five bins: 0 cases (n = 1298, 76.9%); 1-4 cases (n = 186, 11.0%); 5-9 cases (n = 57, 3.4%); 10-99 cases (n = 91, 5.4%); and 100+ cases (n = 56, 3.3%). The forecasts always out-performed the historical benchmark. The forecast skill across all jurisdictions was highest when daily case numbers were zero or ≥ 10, and was lower for days where 1-9 cases were reported ( Figure S8). By using our model to predict how transmission would change over the forecast horizon (relative to the current active-case R eff estimate) forecast performance was greatly increased ( Figure S9, compare blue lines to green lines). We estimated the delay distribution from symptom onset to case notifications being reported, and imputed the symptom onset data for cases where this field was not yet completed (see Section 4.1). The forecasting model was then conditioned on case counts, after accounting for under-reporting, for those dates where at least 50% of cases were estimated to have been reported (see Section 4.2). To determine how this treatment of the case data affected forecast skill, we used a separate "minimal imputation" model that was only conditioned on case counts for dates where at least 95% of cases were estimated to have been reported. We calculated CRPS skill scores for the "minimal imputation" model, relative to the original model, for each of the 7,488 predictions (8 jurisdictions, 26 data dates, 36-day forecasting window). The cumulative distribution of these skill scores is shown in Figure S10.
The minimal imputation model out-performed the default model for 19.3% of the predictions (median skill score of 0.09), yielded identical performance for 47.6% of the predictions, and was out-performed by the default model for 33.2% of the predictions (median skill score of -0.32). Identical performance primarily occurred on days with zero cases (98.1%), with the remainder occurring on days with 1-3 cases. The mean skill score was -0.52. This indicates that imputing symptom onset dates and accounting for delayed ascertainment provided additional information to that available in raw time-series data.
There were 25 predictions where the skill score was ≥ 90%, 23 of which pertained to a single data date for NSW (17 June) or Queensland (6 May).
For New South Wales, the default model was conditioned on data for several days where 1 local case was reported, while the minimal imputation model was not conditioned on these data (because fewer than 95% of cases were estimated to have been reported for these dates). Accordingly, the default model predicted a subsequent increase in local cases, while the minimal imputation model predicted no local cases (95% CrI: 0-1 cases).
These data were revised in subsequent data extracts, and each of the days with 1 reported local case now had no local cases.
The same explanation applies to Queensland, where only 2 cases were reported over the forecast horizon. The default model was conditioned on data for several days where 1 local case was reported, and predicted a larger increase in cases than the minimal imputation model, which was not conditioned on these data. Both sets of forecasts had identical medians (0 cases) and 50% CrIs (0-0 cases), but the default model had wider 95% CrIs (0-18 cases) than the minimal imputation model (0-4 cases). Conditioning on, and not conditioning on, the most recent case counts (dark shades and light shades, respectively) did not exhibit a consistent trend across jurisdictions.

SA
We measured forecast skill relative to an historical benchmark forecast (see Section 4.4). Of the jurisdictions that primarily reported zero cases on a given day, the forecasts consistently out-performed the historical benchmark for all lead times in the Australian Capital Territory and Western Australia, and out-performed the historical benchmark for 1-week lead times in South Australia and Tasmania ( Figure S11). The Northern Territory reported only three cases in the first wave and three cases in the study period (linked to separate importation events); these were ideal circumstances for the historical benchmark, which out-performed the model forecasts. For Tasmania, which experienced a local outbreak in the first wave and reported only two cases in the study period, the forecasts exhibited improved performance for lead times of up to one week, and exhibited decreased performance for longer lead times.
Of the three jurisdictions that experienced substantial COVID-19 activity (New South Wales, Queensland, Victoria), the forecasts exhibited improved performance for shorter lead times ( Figure S11). The forecasts for Victoria, which experienced the largest local outbreak, substantially out-performed the historical benchmark at even the longest lead times. For New South Wales and Victoria, the model consistently out-performed forecasts generated under the assumption that local transmission would be sustained at the current active-case R eff ( Figure S12). During the second wave (June to September, inclusive; 122 days) NSW reported 1-9 cases on 62 days (50.8%), and it was at these case levels that forecast skill was identified to be lowest across all jurisdictions ( Figure S11). In contrast, Victoria reported 1-9 cases on only 14 days (11.5%), and reported at least 100 cases on 56 days (45.9%). Over this time period, the New South Wales forecasts out-performed the historical benchmark at shorter lead times (up to one week ahead), while in Victoria the forecasts consistently out-performed the historical benchmark at all lead times ( Figure S13). By using our model to predict how transmission would change over the forecast horizon (relative to the current active-case R eff estimate) forecast performance was greatly increased ( Figure S14, compare blue lines to green lines). Median forecast (daily cases) CRPS skill score Figure S15: CRPS skill scores binned by the median forecast: 0 cases (81.2% of forecast days); 1-4 cases (6.9%); 5-9 cases (3.5%); 10-99 cases (5.5%); and 100+ cases (2.9%). When considering forecast skill as a function of the number of cases ultimately recorded for each day, as above, the results can only be evaluated in retrospect, since these numbers are unknown when the forecast is made. Accordingly, we also considered forecast skill as a function of the number of daily cases predicted by the forecast model, which may provide assistance when interpreting the forecast outputs in near-real-time. Our intent was to address the question "what does this forecast mean?" before the ground truth is known.

S.3.5 Forecast skill by predicted local cases
The study period comprised 26 data dates across 8 jurisdictions, and the forecast interval spanned 36 days, resulting in 7,488 predictions of daily case counts. We binned these forecasts by the median predictions, which was correlated with forecast sharpness ( Figure S1) and was therefore also indicative of forecast uncertainty, and used the same bins as before: 0 cases; 1-4 cases; 5-9 cases; 10-99 cases; and 100+ cases. As per the skill scores aggregated by the number of cases ultimately recorded for each day, the forecasts always out-performed the historical benchmark. The forecast skill was lowest when the median prediction was zero cases, and was highest when the median prediction was ≥ 100 cases, followed by 1-4 cases ( Figure S15). Forecasts generated under the assumption that transmission would be sustained at the current active-case R eff exhibited similar skill to the default model forecasts when the predicted case numbers were low, but their skill decreased as the predicted case numbers increased ( Figure S16).
In comparison to the skill scores aggregated by the number of cases ultimately recorded for each day, the skill scores aggregated by the predicted case numbers exhibit similar trends for ≥ 5 cases ( Figure S17). In comparison to a median prediction of zero cases, the skill score is higher when there are zero reported cases. And in comparison to a median prediction of 1-4 cases, the skill score is lower when there are 1-4 reported cases. This suggests that when given a near-real-time forecast, where the future case counts are truly unknown, we can plausibly estimate the likely forecast skill when there is a moderate degree of local transmission. S14 Symptom onset date R eff 50% 60% 70% 80% 90% 95% Figure S18: The R eff trajectories for Victoria, shown separately for each data date (righthand axis labels). Each set of trajectories is shown as a spread of quantiles (blue shaded regions). Vertical dashed lines indicate the data date, and horizontal dark yellow lines indicate R eff = 1.  Figure S20: The proportion of Flutracking [2,3] participants who reported ILI symptoms that (a) sought healthcare advice due to these symptoms (green lines); and (b) had a COVID-19 test (orange lines). The average number of participants n who reported ILI symptoms to Flutracking at each week is listed in the sub-plot captions. For comparison, in recent influenza seasons only 3-8% of participants who reported ILI symptoms had an influenza test [4].